ggmotif: An R Package for the extraction and visualization of motifs from MEME software

MEME (Multiple Em for Motif Elicitation) is the most commonly used tool to identify motifs within deoxyribonucleic acid (DNA) or protein sequences. However, the results generated by the MEMEare saved using file formats .xml and .txt, which are difficult to read, visualize, or integrate with other widely used phylogenetic tree packages, such as ggtree. To overcome this problem, we developed the ggmotif R package, which provides two easy-to-use functions that can facilitate the extraction and visualization of motifs from the results files generated by the MEME. ggmotif can extract the information of the location of motif(s) on the corresponding sequence(s) from the .xml format file and visualize it. Additionally, the data extracted by ggmotif can be easily integrated with the phylogenetic data. On the other hand, ggmotif can obtain the sequence of each motif from the .txt format file and draw the sequence logo with the function ggseqlogo from the ggseqlogo R package. The ggmotif R package is freely available (including examples and vignettes) from GitHub at https://github.com/lixiang117423/ggmotif or from CRAN at https://CRAN.R-project.org/package=ggmotif.


Introduction
Motifs are regions (or subsequences) of deoxyribonucleic acid (DNA) or protein sequences with a specific structure. These motifs generally contain functionally important sequences and are therefore often used as a base to classify proteins.
Two main approaches can be used to search and discover motifs: the enumerative approach and the probabilistic approach [1]. The MEME utilizes a probabilistic approach and is one of the most widely used software programs for the identification of novel "sequences" in sets of biological sequences [2][3][4][5][6]. The results generated by MEME command line software include figures of each motif and three other files, an .xml file, a .txt file, and an .html file. The .html file is a nice output for visual inspection of the results. However, the user cannot freely combine the figures they want from the .html file and directly generate them with MEME. The .txt file and .xml file contain almost all the information of the .html file. However, these files are difficult to read and cannot be used to directly generate figures and tables that meet the quality demands of publication standards. Some R/Bioconductor packages have been developed to process the results from MEME [7][8][9]. However, these R/Bioconductor R packages cannot parse the position of each motif on the corresponding sequences from the latest MEME. For example, the functions read_meme and importMeme from the Bioconductor packages universalmotif and memes, respectively, can only parse the information from the .txt file and cannot extract the position information. Their result is a list that is slightly unfriendly for the user. Additionally, the results produced by the MEME are difficult to combine with other downstream analysis software, such as ggtree [10]. The Bioconductor package motifstack can plot stacks with a hierarchical tree of the corresponding motifs [9]. However, in many cases, the user wants to visualize the phylogenetic tree with the corresponding position of each motif on the corresponding sequences. Therefore, in this study, we developed the easy-to-use R package ggmotif to facilitate the extraction and visualization of results from the multiple file formats generated by the MEME and facilitate the integration of the data with other phylogenetic visualization tree packages, such as ggtree.

Obtaining motif information from files generated by MEME
The results files generated by the MEME were stored using the .xml file and .txt file formats. The .xml format file contains the motifs' information on the input sequence(s), and the .txt format file is used to store the sequences of all identified motif(s). The function getMotifFrom-MEME can be used to extract motif information from the .xml and .txt files by simply inputting the .xml format file or .txt format file to obtain the dataframe containing the corresponding information (S1 and S2 Tables). The result extracted from the .xml format file included a set of information, such as the id and length of input sequence(s), content for the motif, start position, end position, and p value, and so on. The user can filter motif(s) by some parameter, for instance, motif id, p value, e-value, and/or Bayes threshold. The other information from the .txt format file mainly included motif id and sequence. Users can filter or select motif(s) of interest for visualization. The functions importMeme from the Bioconductor package memes can parse information from the .txt file. However, the extracted table didn't contain the location of motifs on corresponding sequences (S3 Table). The widely used Bioconductor packages memes or universalmotif both cannot handle the .xml file.

Visualization of motif location on sequences
After obtaining the motif information, the user can then visualize the sequence logos or the location of the motifs on the corresponding sequences. Conversely, the function motifLocation can be used to visualize the location of the motif(s) on the corresponding sequence(s) (Fig 2). The reference code is shown below: library(ggmotif) filepath <-system.file("examples", "meme.xml", package = "ggmotif") motif_extract <-getMotifFromMEME(data = filepath, format = "xml") motif_plot <-motifLocation(data = motif_extract) motif_plot

Visualization of motif location on sequences with a phylogenetic tree
If the user has the corresponding phylogenetic tree, the id(s) is the same as the id(s) in the sequences used to identify motifs using MEME, the function motifLocation with parameter tree can be used to visualize the location of motif(s) on the corresponding sequence(s) after sorting the sequence(s) order with a phylogenetic tree. For example, as shown in Fig 3A, AT4G36920.2 is the id of the sequence used to identify motifs using MEME and construct the phylogenetic tree. So do other sequences. The figure generated by the parameter tree is the same as the above, but the order of sequences(s) is adjusted to fit with the phylogenetic tree ( Fig 3A). The id of the phylogenetic tree requires that all formats supported by Bioconductor package ggtree must be consistent with the sequences inputted to MEME. If other information of phylogenetic tree is available, the parameter tree.anno can be used to plot the tip-point of phylogenetic tree ( Fig 3B). As shown in demo result (Fig 3B), when the motifs positions are similar, the genetic relationship between the sequences is also similar. The reference code is shown below: library(ggmotif) filepath <-system.file("examples", "meme.xml", package = "ggmotif") treepath <-system.file("examples", "ara.nwk", package = "ggmotif") motif_extract <-getMotifFromMEME(data = filepath, format = "xml")

Conclusion
The search and characterization of motifs for the given sequence(s) provides valuable insights into their role in biological regulation. Unlike previously developed tools, such as TBtools [16], ggmotif gives the user familiar with command line tool R a chance to extract and visualize motif information on the corresponding sequence(s). The Bioconductor package motifStack can plot motifs using tree-like structures, but it cannot visualize motif locations on corresponding sequences with phylogenetic trees. In other tools, it is difficult for the user to filter or select the motif of interest to visualize, and the user must almost handle the whole process if the motif of interest is changed. Overall, ggmotif is a user-friendly R package that provides functions to extract and visualize motif information from MEME.
Supporting information S1 Table. The extracted information from the .xml format file by the function getMotif-FromMEME. (XLSX)